A Few Things You Should Know Before Developing Generative AI Applications with Local Models
I see it all over the Internet, social media, YouTube… the whole world raving about the possibilities of generative AI, and generative AI applications using OpenAI, Anthropic APIs sprouting up everywhere.
But these kinds of applications are “easy” to develop because you delegate the “AI” part to a third party (OpenAI, Anthropic, …). And you don’t need to worry about infrastructure, scalability, maintenance, etc.
- Integration boils down to a few lines of code: an API key, an HTTP endpoint, and a JSON payload. No hardware configuration, no complex dependency management. Official SDKs (Python, JavaScript, etc.) simplify development even further.
- Model updates, security patches, and performance optimizations are handled by the provider. You automatically benefit from improvements without any effort on your part.
And that’s pretty cool.
But when you want to develop generative AI applications with local models (i.e. hosted on your own servers or machines), because you have privacy constraints, cost concerns, latency requirements, or simply because you want more control over the models (I do mean the models, not the model), there are a few important things to consider.
And if on top of that it’s on your local machine, that’s yet another level of complexity (assuming you do have at least some GPU power).
Developing with Local Models: Advantages and Constraints
I’ll only talk today about the “purely local” use case — everything will happen on your machine.
Note: I use a MacBook Air M4 for personal needs and a MacBook Pro M2 Max for work, so I benefit from Apple Silicon’s Metal hardware acceleration. But this blog post remains valid for any other hardware with GPU (just adapt according to your hardware capabilities).
Tools like Docker Model Runner, Kronk, Ollama … allow you to run language models directly on your machine, leveraging the GPU (not all GPUs, but with Apple Silicon or recent NVIDIA GPUs, it’s possible).
Advantages
There are of course advantages to using local models (besides the fact that it’s fun and educational):
- Total privacy: data never leaves your machine. This is the major argument for sensitive use cases (medical data, legal documents, personal privacy…)
- Cost: once the hardware is acquired and the model downloaded, each inference is “free”.
- Offline operation: ideal for disconnected environments, travel, or areas with poor connectivity.
- Full control: you choose the models, their versions, their parameters…
- …
But there are important constraints to take into account. And this is where it gets interesting (from the perspective of a developer who likes to code for fun… or have fun while coding 🤔).
Constraints
Performance will be limited by the hardware: I’m fortunate to have 32 GB of RAM on my machines, but in reality, for a smooth user/developer experience, you’d ideally need 64 GB of RAM (or more for very large models). So I generally set my upper limit at 7B to 8B parameter models for my everyday experiments. Most of the time I use models with 0.5B to 4B parameters.
And consequently, the models I use will be less “intelligent” (performant) than larger models.
With 30B+ parameter models, you start getting interesting performance, but that requires more powerful hardware, even though quantization techniques help reduce memory requirements.
System resources consumed are significant: local inference monopolizes the GPU and consumes a lot of memory. During execution, the machine can become sluggish for other tasks.
And of course you won’t have any scalability.
I have a secret dream of building myself a small server farm with GPUs to run local models internally, but that’s not happening anytime soon… 🤓 I need to find a sponsor 😂.
So we’ll stick today with a use case limited to your local machine, for experiments, prototyping, discovery — but also real use cases, because I’m convinced that “small” (7b to 12b), and even “very small” (0.5b to 4b) local models can be very useful in many contexts.
Therefore, to develop useful generative AI applications with local models and “constrained” resources, you need to adopt certain strategies and best practices. And get creative.
Strategies and Optimization Techniques
Facing the hardware constraints (of my machine), several techniques allow us to work around the limitations. The main idea is not to try to replicate the experience of a monolithic cloud model, but to architect our application intelligently and differently.
Multi-Model Architecture: Specialize to Optimize
Rather than using a single large generalist model, the most effective approach is to deploy several specialized models and route requests to the appropriate one.
So we’ll distribute model roles by specialty:
Chat/Conversation models
For this I like using models from the Qwen family. My favorites are qwen2.5 0.5b, qwen2.5 1.5b and qwen2.5 3b (I use the GGUF format ones with Docker Model Runner with Q4_K_M quantization). They’re lightweight and fast, they handle simple conversational interactions, reformulations, summaries. They consume few resources and I get very decent response times.
Code models
They’re specialized for code generation and analysis. Smaller than generalist models but often more performant in the programming domain. Here too I like the “Qwen Coders”, for example I use these:
hf.co/qwen/qwen2.5-coder-1.5b-instruct-gguf:Q4_K_Mhf.co/qwen/qwen2.5-coder-3b-instruct-gguf:Q4_K_M
I’d love so much to use this one Qwen3-Coder-30B-A3B-Instruct — it’s both excellent at code and at “function calling”, which is very interesting for code agent CLIs that use only a single model. But it’s too big for my current machines…
Which means in my case, I’ll need to delegate “function calling” tasks to another model.
Function calling/tools models
They’re specifically trained to extract structured function calls from natural language. Essential for agents, automation, and of course MCP server usage.
Legend has it that small models aren’t great at “function calling”, but I’ve found two that are actually quite good in this area:
These are my darlings and I even use them for chat tasks and data extraction (structured output) in my code agents.
✋ Important note: these 2 models support tools, but also parallel tool calls.
Regarding “structured output”, this is the ability of a model to format its responses into specific data structures (JSON, XML, etc.).
Next, to help our little models that don’t have as much knowledge as the big ones, we can use techniques like RAG (Retrieval-Augmented Generation) to provide additional context. So in this specific use case, we’ll use lightweight embedding models for semantic search (similarity search).
Embedding models
They’re very lightweight, they vectorize text for semantic search and RAG. Can run in parallel without impacting other models. My current choice is: mxbai-embed-large-v1
✋ Important note: embedding models are not all multilingual. Make sure to check this if you work with data in multiple languages. And they can only vectorize relatively short texts (for example 512 to 2048 tokens depending on the model).
You can already guess that you’ll need to manage multiple models in your application, or multiple specialized agents, each with its own model. And you’ll therefore need a smart router to direct requests to the right model. And therefore an appropriate model for that task.
Classification/routing models
For my part, I like using the “structured output” concept to detect intents in user messages and thus decide which model to use. There are other possible techniques, but that will be the subject of another blog post.
For this use case, Jan Nano and Lucy do a great job.
Context Size Management
Another important point is context size management. Small models often have strict context size limits (for example 2048 tokens, but the LLM engine you use allows increasing the default context size provided the model supports it). So you need to be strategic about context management.
Quick reminder about context:
The Context Size (or Context Window) is the total number of tokens the model can process at once
- Think of it as the model’s short-term memory
- Includes everything: system prompt, user messages, history, documents, AND the generated response
For example, if a model has a 32k context window:
input tokens + history tokens + output tokens ≤ 32,000 tokens
The following techniques can help optimize context usage:
- RAG (Retrieval-Augmented Generation): rather than injecting all context into the prompt, store your documents in a vector database. On each request, search for the most relevant passages and inject only those.
- Context Packing and Compression: at regular intervals, ask the model (or another model) to summarize the past conversation and replace the detailed history with this summary.
- Sliding Window with Memory: keep only the last N messages in detail, plus a summary of earlier messages.
Note: Extended context models: some local models support longer contexts (32K, 128K), but beware: a longer context consumes proportionally more memory and slows down inference.
To Summarize (Quickly)
If you develop applications with a Cloud-based Code Agent approach (e.g. Claude Code), you’ll see this:
e.g. Anthropic API] LLM[Single Large Model
e.g. Claude Sonnet/Opus] end CLI -->|HTTP request
+ API key| API API -->|Inference| LLM LLM -->|Response| API API -->|Streamed response| CLI style CLI fill:#4a9eff,stroke:#333,color:#fff style API fill:#ff9f43,stroke:#333,color:#fff style LLM fill:#ee5a24,stroke:#333,color:#fff
One model does everything: chat, code generation, tool calling, summarization, RAG… The heavy lifting is offloaded to the cloud provider.
And if you develop with a Local Multi-Agent Code Agent Architecture, you’ll see this:
routing & coordination] end subgraph Specialized Agents CHAT[Chat Agent
conversation & reformulation
e.g. Qwen 2.5 3B] CODER[Coder Agent
code generation & analysis
e.g. Qwen 2.5 Coder 3B] RAG[RAG Agent
semantic search & context retrieval
e.g. mxbai-embed-large] TOOL[Tool Agent
function calling & MCP
e.g. Jan Nano 4B] COMP[Compressor Agent
context summarization
e.g. Qwen 2.5 0.5B] end subgraph Local LLM Runtime ENGINE[LLM Engine
Ollama / Docker Model Runner] end CLI2 --> ORCH ORCH --> CHAT ORCH --> CODER ORCH --> RAG ORCH --> TOOL ORCH --> COMP CHAT --> ENGINE CODER --> ENGINE RAG --> ENGINE TOOL --> ENGINE COMP --> ENGINE style CLI2 fill:#4a9eff,stroke:#333,color:#fff style ORCH fill:#6c5ce7,stroke:#333,color:#fff style CHAT fill:#00b894,stroke:#333,color:#fff style CODER fill:#fdcb6e,stroke:#333,color:#333 style RAG fill:#e17055,stroke:#333,color:#fff style TOOL fill:#0984e3,stroke:#333,color:#fff style COMP fill:#a29bfe,stroke:#333,color:#fff style ENGINE fill:#636e72,stroke:#333,color:#fff
Multiple small specialized models collaborate through an orchestrator. Each agent handles a specific task with the most appropriate model, maximizing performance within hardware constraints.
Of course, you can hybridize both approaches depending on your needs and constraints. But my scope for this blog post and the ones to follow remains development with local models only.
Conclusion
These elements are important to know because they will guide your choices of frameworks, application architecture, models to use, optimization techniques to adopt, etc. (that’s also what makes these kinds of projects interesting).
You’ll quickly understand that you’ll need to create multiple small specialized agents rather than a single big do-it-all one (doesn’t that remind you of a design pattern? “Low Coupling, High Cohesion” 🤓).
And you’ll need to get creative to work around the limitations of your local models (and your hardware).
This reflection is a return of experience based on my experiments with local models over the past 2 years. It’s also what drove me to develop my own tools to simplify agent creation and help compose them to build useful generative AI applications with local models (I’ll tell you about it in an upcoming blog post).